arXiv:1501.07454v3 [stat.CO] 2 Sep 2015 


Adaptive step size selection for Hessian-based manifold Langevin 


samplers* 


Tore Selland Kleppe^ 
September 3, 2015 


Abstract 

The usage of positive definite metric tensors derived from second derivative information in the context 
of the simplified manifold Metropolis adjusted Langevin algorithm (MALA) is explored. A new adaptive 
step size procedure that resolves the shortcomings of such metric tensors in regions where the log-target 
has near zero curvature in some direction is proposed. The adaptive step size selection also appears 
to alleviate the need for different tuning parameters in transient and stationary regimes that is typical 
of MALA. The combination of metric tensors derived from second derivative information and adaptive 
step size selection constitute a large step towards developing reliable manifold MCMC methods that can 
be implemented automatically for models with unknown or intractable Fisher information, and even for 
target distributions that do not admit factorization into prior and likelihood. Through examples of low 
to moderate dimension, it is shown that proposed methodology performs very well relative to alternative 
MCMC methods. 

Keywords: Adaptive step size, Hamiltonian Monte Carlo, Manifold Langevin, MCMC, Modified Cholesky 
algorithm 

1 Introduction 

Suppose 7f(x) is a target density kernel where x € so that 7r(x) = 7f(x)/ f 7r(x)dx is a probability density 
function. Bayesian analysis of many statistical models necessitates the calculation of integrals with respect 
to 7r(x) when 7r(x) is analytically intractable and only the unnormalized density kernel 7r(x) is available for 
evaluation (see e.g. Gelman et ah, 2014). To approximate such integrals, Markov Chain Monte Carlo (MCMC) 
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has seen widespread use, and the development of better MCMC algorithms that can help researchers tackle 
even larger and more challenging models is still a highly active field (see e.g. Andrieu et ah, 2010; Brooks 
et ah, 2011; Girolami and Calderhead, 2011). 

MCMC relies on constructing a discrete time Markov process {xt} that has 7r(x) as its stationary distribu¬ 
tion. Consequently, an ergodic theorem ensures that, under some technical conditions, integrals on the form 
f g(x)7T(x)dx can approximated as idrig) = ^^r some large T (e.g. Robert and Casella, 2004, 

chapter 6). A very general way of constructing such a Markov process, and that serves as the foundation of 
most MCMC methods, is the Metropolis Hastings (MH) algorithm (Metropolis et ah, 1953; Hastings, 1970). 
Suppose the current state of the Markov process is x* and that q{-\xt) is some proposal probability density 
function. Then the MH algorithm proceeds by proposing x* ~ ( 7 (-|xi), and accepting the new state xt+i to be 
equal to X* with probability a(xt,x*) = min [1, 7r(x*)q’(xt|x*)/(7r(xt)(7(x* |xt))]. With remaining probability 
1 — a{xt,x*), xt+i is set equal to Xj. The MH algorithm is extremely general as only minimal restrictions on 
the proposal density are necessary for {xj} to satisfy detailed balance with 7r(x) as the stationary distribution 
(Robert and Casella, 2004). 

These minimal restrictions have lead to a large number of strategies for choosing proposal distributions 
(see e.g. Liu, 2001; Robert and Casella, 2004). The most widespread strategy is the Gaussian random walk 
proposal, corresponding to g(-|xi) = A/’(-|xt,S) where A/’(x|^, S) denotes the density evaluated at 

X. The covariance matrix E should be chosen according to the target distribution at hand. As a rule of 
thumb, a scale of q{-\xt) that is “large” relative to the scale of the target density leads to a low acceptance 
probability, and thereby a slow exploration of target distribution. At the other end of the spectrum, if the 
scale of 9(-|xt) is “small” relative to the scale of the target, one typically obtain a high acceptance probability 
but the cumulative distance traversed by many accepted steps will still be small. Provided that the variance 
of (j,T{g) exists, both extreme cases lead to the variance of grig) being large for fixed T, whereas intermediate 
choices of scale typically lead to more efficient sampling in the sense that the variance of (trig) is smaller for 
fixed T than the variance at the extreme cases. It should be noted that “large” and “small” scales relative 
to the target are not absolute or even well-defined magnitudes, and depend among many other aspects, on 
the dimension d of the target (see e.g. Gelman et ah, 1996; Roberts and Rosenthal, 2001). An additional 
complication is that many target distributions of interest, e.g. joint posterior distributions of latent variables 
and variance parameter of the latent variables, contain significantly different scaling properties in different 
regions of the support of the target. Consequently, there is often a need for proposal distributions to adapt 
to local properties of the target distribution to achieve efficient sampling. 

Many researchers have explored MCMC methods that exploit local derivative information from the target 
distribution to obtain more efficient sampling. Early contributions in this strand of the literature includes 
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the Metropolis adjusted Langevin algorithm (MALA) (see e.g. Roberts and Tweedie, 1996; Roberts and 
Stramer, 2002). MALA is a special case of the MH algorithm where the proposal distribution is equal to 
a time-discretization of a Langevin diffusion, where the Langevin diffusion has the target distribution as 
its stationary distribution. The drift term in such diffusion, and consequently the mean of the proposal 
distribution, will depend on the gradient of log7r(x). To achieve reasonable acceptance rates, it is often 
necessary to choose the (time-) step size of the proposal distribution rather small, and it has been argued 
that the MALA in this case will revert to a behavior similar to that of the random walk MH (Neal, 2010). 

Another way of constructing MCMC samplers that exploit derivative information is Hamiltonian (or 
Hybrid) Monte Carlo (HMC) (Duane et ah, 1987; Liu, 2001; Neal, 2010; Beskos et ah, 2013; Betancourt 
et ah, 2014; Hoffman and Gelman, 2014). HMC has the potential of producing proposals that are far from 
the current state, while retaining arbitrarily high acceptance rates. The proposals originate from numerically 
integrating a set of ordinary differential equations that involve the gradient of log 7r(x) and it is often necessary 
to evaluate this gradient hundreds of times per proposed state to achieve reasonable acceptance rates while 
retaining proposals that are far from the current state. 

Most implementations of MALA and HMC involve a user-specified scaling matrix (in addition to other 
tuning parameters) to achieve reasonable efficiency. In a landmark paper, Girolami and Calderhead (2011) 
proposes to recast MALA and HMC on a Riemann manifold that respects the local scaling properties of the 
target, and consequently alleviates the need to choose a global scaling matrix. Suppose x are the parameters 
to be sampled and y the observations associated with a Bayesian statistical model that admit the explicit 
factorization into data likelihood p(y|x) and prior distribution p(x), i.e. 7r(x) oc p(y|x)p(x). Then Girolami 
and Calderhead (2011) took their metric tensor to be the matrix 

Ggc(x) = -^^p(y|x) [V^ logp(y|x)] - logp(x), (1) 

namely the Fisher information matrix associated with data likelihood plus the negative Hessian of the log- 
prior. This choice was demonstrated to be highly suited for Langevin and Hamiltonian based algorithms that 
take local scaling properties into account. However, due to the expectation in (1), the Fisher information 
matrix is rarely available in closed form, and the potential for automating the implementation of MCMC 
samplers based on such a metric seems a formidable task. 

In this paper, an alternative approach based on deriving a metric tensor for simplified manifold MALA 
directly from the negative Hessian of log7r(x) is explored. As noted by Girolami and Calderhead (2011) and 
several of the discussants of that paper (see e.g. Sanz-Serna, 2011; Guerrera et ah, 2011; Jasra and Singh, 
2011), this approach admit a high degree of automation via e.g. the usage of automatic differentiation software 
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(Griewank, 2000), but is complicated by the fact that the negative Hessian is typically not positive definite 
over the complete support of the target. Obtaining positive definite local scaling matrices from potentially 
indefinite Hessian matrices for application in modified Newton methods has seen substantial treatment in 
non-convex numerical optimization literature (see section 6.3 of Nocedal and Wright, 1999, for an overview). 
Such approaches typically rely on either modification of the eigenvalues via a full spectral decomposition 
of the Hessian, or some form of modified Cholesky algorithm that produces a factorization of a symmetric 
positive definite matrix that is close to the Hessian in some metric. In the MCMC context, the former 
approach was taken by Martin et al. (2012); Betancourt (2013), whereas in the present paper I advocate the 
latter as it has potential to exploit sparsity of the Hessian. 

The added generality with respect to target densities that can be handled by Hessian-based metric tensors 
(based either on the spectral decomposition or modified Cholesky factorizations) relative to (1) comes at the 
cost of potential pathological behavior in regions of the log-target with near zero curvature in some direction. 
To counteract such effects, the primary contribution of this paper is a new adaptive step size procedure for 
simplified manifold MALA that chooses the step size locally based how well the Hessian-based metric tensor 
reflects the local scaling of the target. The adaptive step size procedure remains active throughout the MCMC 
simulation and at the same time retains the target as the stationary distribution. The adaptive step size 
procedure exploits the well known fact that a MALA update is equivalent to a particular HMC method using 
one time integration step (Neal, 2010), and therefore admits selection of step sizes for simplified manifold 
MALA based on the (dimensionless) energy error of the associated Hamiltonian system. An additional pro 
of the adaptive step size procedure is that it appears to alleviate the need for different tuning parameters 
in transient and stationary regimes that is typical for standard MALA implementations (Christensen et ah, 
2005). 

The remaining of the paper is laid out as follows. Section 2 describes the applied metric tensor and the 
adaptive step size selection, and illustrates the proposed methodology using a pilot example. Section 3 illus¬ 
trates the proposed methodology on two realistic example models, and compares the proposed methodology 
to alternative MCMC methods. Finally section 4 provides some discussion. 

2 Adaptive step size modified Hessian MALA 

Assume that 7r(x) is sufficiently smooth to admit continuous and bounded derivatives up to order 2. I denote 
the gradient of the log-kernel g(x) = Vxlog7r(x) and the Hessian of the log-kernel as i?(x) = Vxlog7r(x). 
The d X d identity matrix is denoted Id. 

I take as vantage point the simplified manifold MALA (sMMALA) or position specific preconditioned 
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MALA of Girolami and Calderhead (2011) (see also Livingstone and Girolami, 2014; Xifara et al., 2014), 
which is a MH method characterized by the proposal distribution 

<?(-|x) = Af (^-Ix + y [G(x)]”^ g{x) , [G(x)]”^^ . (2) 

Here G(x) is a d x d symmetric and positive definite matrix, from now on referred to as the metric tensor, 
and e is a tunable step size. 

2.1 Modified Cholesky based metric tensor 

This section explains a metric tensor based on an adaptation of the modified Gholesky decomposition of Gill 
and Murray (1974), Gill et al. (1981), from now on referred to as GMW. Suppose A G is a symmetric, not 
necessarily positive definite matrix. Given a user-specified small positive scale parameter u, our adaptation 
of GMW (details and algorithm in Appendix A) aims at finding a Gholesky decomposition LL^ where L is 
lower triangular so that 

A = LL^ = A + J, where > umax ( 1, max |Ai ) >0, i = 1,... ,d. (3) 

’ V ’ / 

The matrix J is a diagonal matrix with non-negative elements chosen to ensure that A is positive definite, 
and in particular the design of the GMW Gholesky algorithm ensures that if A is sufficiently positive definite, 
then J = 0. As J is a diagonal matrix and assuming that A has non-zero diagonal elements, the sparsity 
structures of A and A -|- J are identical. This implies that whenever A has a sparsity structure that can 
be exploited using sparse numerical Gholesky factorizations (Davis, 2006), the same applies for the GMW 
factorization. 

The GMW factorization has seen widespread use within the numerical optimization literature as a key 
ingredient in practical Newton optimization algorithms for non-convex optimization problems (Nocedal and 
Wright, 1999), and has also served as the basis for further refinements as in Schnabel and Eskow (1990, 1999). 
For simplicity, I employ a variant of the algorithm given in Gill et al. (1981). 

In the remainder of this paper I work with the modified Hessian metric tensor 

Gmc(x) = L(x)L(x)'^, 

where L(x) is the output from the GMW Gholesky factorization (3) applied to —H{x). In itself, this approach 
for finding a metric tensor is not fail-safe. To see this, consider for instance the d = 1 case, for which the 
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Figure 1: 50,000 iterations of sMMALA with metric tensor (4), u = 0.001 and £ = 0.75 applied to a t- 
distribution with 4 degrees of freedom. It is seen that the chain rarely enters the regions near |x|=2, and 
when it does, it tends to get stuck. An example of this is seen starting in iteration 21347 where the chain is 
stuck for 340 iterations at approximately 1.9529. 

proposed metric tensor reduces to 

Gmc (x) = max(u, |(x) |). (4) 

Typically I take u to be rather small in order not to disturb the proposal mechanism of sMMALA (2) in 
regions where |i7(x)| is small but still contains useful information on the local scaling properties of the target 
distribution. However if left unmodified, a sMMALA method based on (4) would very infrequently move 
into regions near roots of the second derivative (inflection points). To see this, observe that the backward 
transition density ^(xtlx*) occurring in the nominator of the acceptance probability will be 

q{xt\x*) = 0{yGl/s) when G(x*) = u. (5) 

When the method has moved into such a region, the chain will tend to be stuck for many iterations as the 
proposal distribution then has too large variance. 

To illustrate this phenomenon, consider a t-distributed target with 4 degrees of freedom. In this case 
there are two inflection points of the log-density located at |x| = 2. Figure 1 shows 50,000 MCMC iterations 
using sMMALA based on (4) for this target. It is seen that the chain rarely enters the regions close to the 
inflection points, and when it does, the chain tend to get stuck. It is worth noticing that these effects are not 
unique to the modified Cholesky approach. Specifically, (4) would also be the result if the metric tensor was 
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computed by modifying the eigenvalues of a full spectral decomposition as in Martin et al. (2012), and the 
metric tensor would be a soft/regularized absolute value applied to H (x) in the methodology of Betancourt 
(2013). Rather the effects are the cost of the added generality associated with using second order derivative 
information directly rather than via the Fisher information matrix as in Girolami and Calderhead (2011). 
To mitigate these undesirable effects, (5) suggest either that u or e must be chosen adaptively. In this work, 
I focus on the latter as a simple and computationally efficient method can be implemented for this purpose. 

2.2 sMMALA with randomized adaptive selection of e 

In this work, I propose to select the proposal step size as a function of the current state xt and an easily 
simulated auxiliary random variable ~ 7 r(w), namely e = £(xt,W(). The role of w is essentially to 
randomize the step size, and a further motivation for including it will be clear from section 2.3 when the 
functional forms of •) advocated here are discussed. Before that, I introduce the an overarching sampling 
algorithm in the form of a Metropolis within Gibbs sampler based on the sMMALA update where invariant 
distribution will be shown to be the augmented target 

7r(x, w) = 7r(x)7r(w). (6) 

Notice that the algorithm with target ( 6 ) is primarily a construct to make the method time-homogenous and 
thereby enabling easy establishment of convergence results when an adaptive step size selection £(x,w) is 
employed. In practice only the x-subchain is of interest. 

Given the current configuration (xt,Wi) ^ 7 r(x,w), one step of the proposed Metropolis within Gibbs 
sampler (see e.g. Robert and Gasella, 2004, Algorithm A.43) based on sMMALA for the augmented target 
may then be summarized by the steps 

1. Gompute forward step size £/ = £(xi,wt). 

2. Draw proposal xj'_,_i ~ N(xt -|- ^G“^(xdg(x(),£/G“^(xd). 

3. Gompute backward step size Sb = £(x 

4. Gompute MH accept probability 

( ft(x*+i)Af (xt|x*+i -k ^G-1 (x*+i)5(x*+i),£2G-1(x*+i)^ 

o(xt,Xt%i) = min 1 ,--- -5 --- 

7t(xt)A/'(^Xt%i|xt -k ^G-i(x 0 ff(xt),e/G-i(xt)] 

Let Xi_|_i = with probability Q;(xt,Xj^j^) and let Xj+i = x^ with remaining probability. 
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5. Draw Wt+i ~ 7r(w). 


Notice in particular that £f,£b are computed using the same w-argument and therefore steps 1-4 constitute 
a reversible MH step for each w. Based on this observation, I prove following result: 

Proposition 1: Provided that 0 < £(x, w) < oo for each (x, w) in the support o/7r(x, w), the Metropolis 
within Gibbs sampler in steps 1-5 is -irreducible, aperiodic and has 7r(x, w) as invariant distribution. 

The proof is given in Appendix B and relies on the fact that for a Metropolis within Gibbs sampler, the 
parameters of the proposal distribution in the MH step for updating one block (x) may depend on the current 
state of the remaining blocks (w) (see e.g. Gilks et ah, 1994; Robert and Gasella, 2004; Zhang and Sutton, 
2011). Note in particular that steps 1-5 do not fulfill detailed balance for target 7r(x,w). On the other 
hand, a (time-inhomogenous due to changing w) detailed balance holds for each transition of the x-subchain. 
Having established that such adaptive selection of step size does not disturb the invariant distribution of 
the x-subchain for a large class of functions £{■,■), I now consider the specific functional forms of £(•,•) I 
advocate. 

2.3 Adaptive step size selection based on Hamiltonian dynamics 

The method for selection of e uses a well-known duality between MALA and HMG, namely that the proposal 
of MALA corresponds to a single time-integration step of Euclidian metric HMG starting at Xj when the leap 
frog integrator (Leimkuhler and Reich, 2004) is employed (Neal, 2010). Here I propose to choose the step 
size for sMMALA so that the energy error of a single trial step of the corresponding (Euclidian metric) HMG 
method with mass matrix G(xt) is below a tunable threshold. The rationale for such an approach is that the 
absolute energy error provides a measure of time integration error that is comparable across different areas 
of the target density, and that will be large if the local scaling properties of log7r(x) are poorly reflected by 
G(x), as in the student t-pilot example discussed above. Gomputing the absolute energy error of the trial 
step is inexpensive relative to e.g. doing complete trial sMMALA steps for different values of e, with the 
computational cost typically dominated by a few additional gradient evaluations. 

To implement such energy error based selection £(•, •), I first take 7r(w) to be a d-dimensional standard 
Gaussian variable. Given the current configuration of (xt,Wi), I define the position specific dummy Hamil¬ 
tonian 

'H(q(T),p(T)|x0 = -logit(q(T)) + ip(T)^G(xt)-ip(r), (7) 

where q(T) denotes position and p(t) denotes momentum at fictional time t. Then a single leap frog step of 



(time-) size £ starting from q(0) = and p(0) = L(xt)wt is performed: 


P(e/2) = p(0) -f |g(q(0)) = L(xt)wt -f 

q(£) = q(0)-f eG(xt)"^p(£/2) = xt-f yG"^(xt) 5 (xt)-f £L(xt)"^Wt = x*(£|xt,wt), (8) 

p(£) = p(£/ 2)|5(q(£)) = L(xt)wt-f I (g(xt)-f g(x*(£|xt,wt))). 

The trial proposal x*(£|xt, w*) would occur if the standard normal vector wt was used to generate a proposal 
from (2) with current state equal to x* and time step size e. The energy error associated with this trial time 
integration step is given as 


A(£|xt,wt) = 'H(q(0),p(0)|xt)-'H(q(£),p(£)|xt) 

= -log7r(xt)-f log7r(x*(£|xt,Wt)) 

2 

S '-p £ 'll 

—-w. r ——r r 
2 ‘ 8 

where r = L(xt)“^ + 5 (x*(£|xi, wt))). Based on the expression for A(e|xt,wt), a possible method for 

choosing e = £(x,w) would involve the following steps: 

Suppose 7 > 0 is the tunable maximal allowable trial energy error, with lower values of 7 corresponding to 
higher acceptance rates and smaller e. Let 0 < £ < 00 be the maximal step size and let £0 denote the smallest 
positive root in £ of |A(£|x, w)| = 7 . An idealized candidate for choosing e is then £(x,w) = min(£,£o). The 
£ > 0 bound required in Proposition 1 is automatically fulfilled as consequence smoothness assumptions on 
log 7 r(x), and thus such a step size function would lead the sampling algorithm outlined in section 2.2 to have 
the correct invariant distribution. In practice, locating £0 would amount to solving an equation that involves 
the gradient of the log-target numerically, and for computational effectivity reasons I therefore propose a 
method for only approximating £0 in Section 2.4. 

It is worth noticing that the Hamiltonian (7) should be interpreted in the Euclidian metric sense as 
a tool to measure the time integration error around q( 0 ) = xj, p( 0 ) = L(xt)wt when G(xt) is the fixed 
mass matrix. This is relevant for sMMALA as the distribution of proposed q(£)-states is (2) in this case. 
Moreover, it is in contrast to interpreting (7) in a Riemann manifold HMC sense with Hamiltonian say 
— log 7 r(q) -I- ^p^G(q)“^p (which is off target due to a missing i log |G(q)| term (Girolami and Calderhead, 
2011; Betancourt, 2013)). Under the latter interpretation, the energy error is irrelevant for sMMALA as the 
proposed q(£) will not be distributed according to ( 2 ) regardless of the numerical integrator being employed, 
as the time evolution of p will involve the gradient of |p^G(q)“^p with respect to q. In the same vein, 
I notice that min(l, exp(A)) cannot be interpreted as a trial acceptance probability for the sMMALA as 
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Figure 2: 50,000 iterations of sMMALA with metric tensor (4), u = 0.001, £(x, w) = min(l,eo) as defined 
in the text and 7 = 1.0 applied to a t-distribution with 4 degrees of freedom. It is seen that the adaptive 
selection of step size resolves sampling in the problematic regions around the inflection points. 

the sMMALA deviates from being a reversible and volume-preserving discretization of either Hamiltonian 
when the x^-dependence of the mass matrix is taken into account in (7). Rather, 7 should be interpreted 
as a tolerance of local (irreversible) time integration energy error, and in particular its relationship to the 
acceptance probability of the overall (reversible) sMMALA step is non-linear. 

The role of w is to act as typical (standardized) momentum variable that changes in each iteration of 
the overarching MCMC procedure to reflect the variation in the distribution of the proposal (2). A further 
potential candidate would be to integrate out w, e.g. e(x) = / e(x, w) 7 r(w)(iw so that steps 1-4 of section 
2.2 would constitute a time-homogenous and reversible MH step. However even for moderate dimension d 
this becomes computationally intractable, as this integral has to be computed numerically. Further, it is not 
obvious that a deterministic (conditional on x) step size e(x) has clear advantages over a stochastic step size 
e(x,w). 

To illustrate how the adaptive step size selection £:(x, w) = min(l,eo) resolves the shortcomings of sM¬ 
MALA near inflection points, I consider again the t 4 —distribution target considered in section 2.1 with metric 
tensor (4). Figure 2 presents 50,000 MCMC iterations using adaptive step size sMMALA using a maximal 
absolute energy error 7 = 1.0. It is seen that the adaptive step size selection resolves the inefficient sampling 
near the inflection points seen for the fixed step size implementation in Figure 1. Figure 3 shows the actual 
values of the (forward) adaptive step sizes £(xi, wj) as a function of xt, along with their expected value found 
by integrating out w. It is seen that the energy error criterion appear to discriminate very well that the 
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Dynamic e conditional on x 



X 


Figure 3: Values of the (forward) step size £(xt,wj) = min(l,eo) as a function of Xj for the simulation 
presented in Figure 2 are given as dots. The solid line indicates the average adaptive step size calculated as 
f e(x,w)p(w)dw as a function of x. 


metric tensor (4) shows pathological behavior around the inflection points at |x| = 2. 

As pointed out by a reviewer, the above energy error argument can also be applied in non-adaptive 
step size sMMALA methods to locate regions of the support where deficiencies in the metric tensor lead to 
inefficient sampling analogous to what is seen in Figure 1. This is in particularly of interest for d > 3, as then 
regions with inefficient sampling could be difficult to spot using e.g. trace or scatter plots. In practice, using 
energy error for such an end would amount to calculating the energy error associated with each proposed 
state. Specifically, suppose Xj is the current state of a non-adaptive sMMALA method and z ~ 7V(0,/d) is 
the standardized variable used to generate a proposal x* according to (2). The forward (backward) energy 
error A{ (Aj) associated with the proposal x* are given by 




s 


s 

-log7r(xt) -klog7f(x*) - 

e 

-log7r(x*) -f log7r(xt) - -s^r;, - 


.-1 


L(x 


*\T 



2 


G-1(x*)5(x*) 


where r/ = L(xt) ^ {g{xt) + g{x*)) and rt = L{x*) ^ {g{xt) + g{x*)). A large |A{| (|A^|) (e.g. lAj^] > 
5, fc = f,b) will indicate that local scaling properties in the region around x* (x*) are poorly reflected by 
G(xt) (G(x*)) and the user should consider revising the specification of G or at least reduce e. Note in 
particular that A*, k = f,b can be computed in each iteration with minimal additional cost as the involved 
log-kernels, gradients and metric tensors must be computed anyway to form the acceptance probability for 
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Algorithm 1 Practical backtracking line search algorithm. 


1: Set £ 1^1 (or some reasonable e) 

2: Set ^ 10 (or some reasonable energy error threshold). 
3: Set p ^ 0.5 (or some reasonable step size decrement). 

4: for(s = 1,2,... ) 

5: Compute |A(£ 5 |x,w)|. 

6: if(|A(£s|x,w)| >/3) 

7: £s+i pSs- 

8: else 

9: if(|A(£ 5 |x,w)| < 7 ) 

10: Return £(x, w) = e^. 

11: else 

12 : 

13: end if 

14: end if 

15: end for 


proposal X*. 

2.4 Practical implementation and choosing 7 

For practical applications it is not necessary, nor desirable from a computational perspective, to find Sq with 
high precision. Rather a backtracking iterative scheme as exemplified in Algorithm 1 can be used. This 
algorithm is inspired by line searches commonly used numerical optimization (Nocedal and Wright, 1999) 
and the overarching idea is to generate a sequence trial step sizes £ = £1 > £2 > •. ■ until the criterion 
|A(£s|x,w)| < 7 is fulfilled. Algorithm 1 has two types of decrements, where in the case that absolute energy 
error is greater than /?, the next trial step size is a factor p < 1 times the incumbent one. If the absolute 
energy error is smaller than j3, but greater than 7 , the choice of the next trial step size is informed by the 
fact that A(£|x, w) = 0{£^) for small e. More specifically, I let next trial step £s+i be 0.95 times the root in 
£ of As(£) — 7 where As(£) = (£/£s)^|A(£s|x, w)| is the third order monomial that interpolates the observed 
absolute energy error at Eg. The factor 0.95 is included to counteract slow convergence infrequently seen 
when |A(£s|x, w)| is close to 7 . Of course Algorithm 1 constitute only an example of how to implement 
£(x, w), and may be further tuned and refined for each model instance without disturbing the stationary 
distribution as long as it remains the same throughout the MCMC simulation and fulfills the assumptions in 
Proposition 1. The cost of each iteration will typically be dominated by the gradient evaluation needed for 
computing r for each trial e. 

For non-Gaussian targets, 7 needs to be tuned to obtain e.g. the desired acceptance rate or to maximize 


12 





some more direct performance measure such as the expected jumping distance or effective sample size. The 
tuning could be done by e.g. dual averaging during the burn in iterations as in Hoffman and Gelman (2014). 
However, I have found that as a rule of thumb, values of 7 between 1 and 2 tends to produce acceptable 
results for low to moderate-dimensional problems as the ones considered below. In this case, the adaptive 
step size selection typically produces long step sizes (s ~ e) when G contains useful scaling information, but 
acts as safeguard and reduces the step size substantially when G shows some form of pathological behavior. 

It is interesting to consider the impact of dimension d on the proposed adaptive step size selection, and 
consequently on the overall performance of the proposed methodology. For this purpose, I consider any 
d-dimensional non-degenerate IV(/i, S) target, as it is rather straightforward to show that 

F;(x,w) (A(£|x,w)) = d , 

when G(x) = —iJ(x) = 11“^. For any fixed 7, which in the Gaussian case translates to a fixed acceptance 
rate invariant of d, the adaptive step size has leading term behavior £ = 0{d~^^^). Thus for high-dimensional 
target distributions, it is inevitable that adaptive step size sMMALA will revert to random walk behavior, 
and as a consequence I consider primarily low to moderate-dimensional applications as the primary scope. 

3 Illustrations 

This section considers example models and compares the proposed Adaptive step size Modified Hessian 
MALA (AMH-MALA) methodology to alternatives. The GARGH(1,1) model with t-distributed innovations 
considered first is included to highlight the behavior of AMH-MALA for a posterior distribution that exhibit 
strong non-linear dependence structures and has indefinite Hessian in substantial parts of the relevant pa¬ 
rameter space. The Bayesian binary response regressions considered afterwards are included to allow for easy 
comparison with the methodology presented in Girolami and Galderhead (2011). 

To compare the performance of different MGMG methods, I follow Girolami and Galderhead (2011) in esti¬ 
mating effective sample size (ESS) via the initial monotone sequence estimator of Geyer (1992), and in partic¬ 
ular compare the minimum ESS (across the d dimensions of tt) per second GPU-time spent obtaining the sam¬ 
ples. All computations were carried out on a 2014 macbook pro with a 2 GHz Intel Gore i7 processor and 8 GB 
of memory. The code used for the simulations is available at http://www.ux.uis.no/~tore/code/adaptive_langevin/, 
including a GH—h function implementing the dense modified Gholesky factorization that can be called from 
Matlab. 
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Method 

CPU time 
(s) 

ao 

ESS 

ai 

ESS 

/3 

ESS 

V 

ESS 

minimum ESS 
per second 

bayesGARCH 

18.7 

80.0 

52.9 

48.2 

63.9 

(2.46) 

AMH-MALA 

140 

283 

310 

252 

398 

1.79 

AMH-MALA(eig) 

146 

251 

262 

229 

424 

1.51 

HMC 

2653 

4738 

2284 

2503 

4993 

0.86 


Table 1: Effective samples sizes and minimum effective samples sizes per second CPU time for different 
MCMC algorithms applied to the GARCH(1,1) model with t-distributed innovations. Best performances are 
indicated in bold font. All figures are calculated over 5000 MCMC iterations after burn in. Note that the 
bayesGARCH Gibbs sampler is partially written in C, and thus the CPU-time and minimum ESS per CPU 
time are not directly comparable to the remaining figures. 


3.1 GARCH(1,1) model with t-distributed innovations 

As the first realistic illustration I consider sampling the posterior parameters of a GARCH(1,1) model with 
t-distributed innovations (see e.g. McNeil et ah, 2005, Chapter 4) for log-return observations yi,...,yT on 
the form 


Vi = Vi = Y ^ Vi ~ iid i = (9) 

hi = 00 + ciiVi-i + Phi-i, i = 2,..., T, hi = ao- (10) 


The priors are taken from Ardia and Hoogerheide (2010) and are the default priors used in the associated 
bayesGARCH R-package for the same model. Specifically, ao,Q:i,/3 have independent truncated normal priors 
p{ak) cc exp(— 1 fc = 0,1 and p(/3) oc exp(— 1 y^)1{^>o}- The prior for the degrees of freedom 

parameter v \s ‘a truncated exponential, namely p{i’) oc exp(— ^/100)l{,y>2}. For our illustration, I employ a 
data set of Deutschmark vs British Pound (DEM/GBP) foreign exchange rate log-returns at daily frequency, 
which is also included in the bayesGARCH package. The sample covers January 3, 1985 to December 31, 1991, 
and constitute T = 1974 log-return observations. 

The proposed AMH-MALA methodology is compared to the Gibbs sampler implemented in the bayesGARCH 
package (Ardia and Hoogerheide, 2010) and HMC with mass matrix fixed to the identity matrix. For HMC 
I follow Girolami and Calderhead (2011) and choose 100 time integration steps and the integration step size 
was taken to be £ = 0.0075 ± 10% uniform noise to attain acceptance rates of around 80%. In addition 
I also consider AMH-MALA implemented with a metric tensor similar to the one proposed by Betancourt 
(2013) and denote this by AMH-MALA(eig). Specifically, let Xi denote the ith eigenvalue of —H. Then G 
has the same eigenvectors as —H but the Ah eigenvalue is taken as max(|Ai|, 0.001). For both AMH-MALA 
methods, I use 7 = 1.0 and Algorithm 1 with (3 = 10.0, p = 0.5 for adaptive step size selection. For AMH- 
MALA based on the modified Cholesky, I use u = 0.001. HMC and the AMH-MALA methods are applied in 
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MCMC iteration 

Figure 4: Diagnostics for a typical run of AMH-MALA with initial parameters set to the unrealistic values 
log(Q;o) = —10) log(Q;i) = —1, log(/3) = —3 and v = 20. The upper 4 panels are trace plots of parameters 
q;o)Q!i,/ 1 and log(i/) where the logarithm is applied to latter for visual reasons. The 5. panel is a trace plot 
of the forward step size £/. The 6. panel depicts the smallest eigenvalue of with smallest eigenvalue 

for the first 211 iterations being smaller than -25. The last panel shows the log-target target density at the 
current iteration, i.e. log7r(xt). 
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log-transformed parameters ag = log(ao), a'^ = log(ai), /?' = log(/3) and v' — \og{v — 2) to remove potential 
numerical problems related to the truncations imposed by the priors. 

Table 1 provides mean CPU times, mean ESSes and mean minimum ESS per computing time for the 
parameters of the GARCH model (9-10) across 10 independent repeated runs of 5000 samples each. For 
bayesGARCH I used 5000 burn in iterations, whereas the remaining methods used 1000 burn in iterations. 
The reported CPU times excludes burn in. It should be noted that the Gibbs sampler in the bayesGARCH 
package is written partially in C and the computing times are therefore not directly comparable to the 
remaining methods, which are implemented fully in Matlab. It is seen that AMH-MALA produces the most 
effective samples per unit computing time for the methods written in Matlab, and also produces substantially 
better ESSes per iteration than the bayesGARCH. AMH-MALA and AMH-MALA(eig) produces similar results, 
indicating that there is little added value to using full spectral decomposition over the modified Cholesky 
factorization for this situation. 

Figure 4 depicts various diagnostic output for a typical run of AMH-MALA applied to the parameters 
of (9-10). The initial parameters are set to highly ill informed values to illustrate the different behavior of 
AMH-MALA in transient and stationary regimes. It is seen that AMH-MALA takes shorter forward steps 
in the transient regime up to approximately iteration 220. The negative Hessian is severely indefinite for 
these iterations. This shows that choosing step size based on the energy error enables to AMH-MALA to 
make progress in the transient regime with the same tuning parameters as are used in the stationary regime. 
To contrast this, fixing ey = £;, = 1 but otherwise keeping the setup identical to that reported in Figure 4 
results in a single accepted move in the course of the 2000 iterations, indicating the need for different tuning 
parameters in stationary and transient regimes (Christensen et ah, 2005). 

Visual inspection of Figure 4 indicates that small values of £/ are often associated with near zero minimum 
eigenvalues. This can be statistically confirmed as the correlation between log|miniAi| and £/ is 0.63 for 
iterations 220-2000. This indicates that the adaptive step size procedure on average sees both large positive- 
and large negative curvature information as useful and consequently takes long step sizes, whereas it takes 
shorter step sizes when very little curvature information is available in some direction. Still there is a small 
positive correlation of 0.24 between min^ and the acceptance probability, indicating that AMH-MALA 
performs slightly better in regions of the support where the target is log-concave. 

3.2 Bayesian binary response regression 

This section considers Bayesian inference for two types of binary response generalized linear models. The 
models are included in order to compare the proposed methodology with that of Girolami and Calderhead 
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Model 

Method 

CPU time 
(s) 

minimum 

ESS 

median 

ESS 

maximum 

ESS 

minimum ESS 
per second 


Australian credit data set {d 

= 15, n = 690) 


logit 

AMH-MALA 

2.6 

436 

579 

725 

167 

logit 

sMMALA 

1.6 

436 

591 

745 

271 

probit 

AMH-MALA 

3.4 

575 

774 

919 

167 

probit 

sMMALA 

1.8 

491 

665 

869 

280 

probit 

Adaptive sMMALA 

3.3 

475 

652 

809 

142 


German credit data set (d = 

25, n = 1000) 


logit 

AMH-MALA 

3.6 

417 

570 

707 

117 

logit 

sMMALA 

2.1 

432 

616 

751 

202 

probit 

AMH-MALA 

4.5 

579 

747 

886 

129 

probit 

sMMALA 

2.4 

534 

700 

846 

226 

probit 

Adaptive sMMALA 

4.4 

488 

649 

785 

111 



Heart data set (d = 14, 

n = 270) 



logit 

AMH-MALA 

2.0 

362 

468 

579 

176 

logit 

sMMALA 

1.3 

364 

473 

587 

273 

probit 

AMH-MALA 

2.5 

612 

760 

883 

247 

probit 

sMMALA 

1.4 

451 

593 

734 

331 

probit 

Adaptive sMMALA 

2.4 

475 

590 

719 

197 


Pima Indian data set (d = 

8, n = 532) 



logit 

AMH-MALA 

2.2 

1043 

1184 

1296 

479 

logit 

sMMALA 

1.4 

1010 

1174 

1314 

732 

probit 

AMH-MALA 

2.7 

1143 

1282 

1428 

425 

probit 

sMMALA 

1.4 

1089 

1253 

1480 

752 

probit 

Adaptive sMMALA 

2.6 

1116 

1242 

1454 

431 



Ripley data set (d = 7, 

n = 250) 



logit 

AMH-MALA 

1.3 

285 

375 

467 

227 

logit 

sMMALA 

0.6 

291 

387 

464 

506 

probit 

AMH-MALA 

1.8 

387 

536 

623 

212 

probit 

sMMALA 

0.8 

287 

379 

490 

381 

probit 

Adaptive sMMALA 

1.8 

293 

391 

506 

161 


Table 2: Effective sample sizes and CPU times for the Bayesian binary response regressions. Best perfor¬ 
mances are indicated with bold font, “logit” correspond models with logistic link function. In this case 
the negative Hessian and Fisher information matrix coincides, “probit” correspond to the standard normal 
cumulative distribution function as the inverse link function. In this case the negative Hessian is different 
from the Fisher information matrix. sMMALA replicates the simplified manifold MALA of Girolami and 
Calderhead (2011) whereas Adaptive sMMALA uses the metric tensor of Girolami and Calderhead (2011) 
along with adaptive step size selection. 
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(2011). Specifically, I consider models for observed binary responses y = {yi,... ,y„) on the form 

P{yt = P) = P[{X|3)^], i = (11) 

p ~ fV(0, 100/d). 

Here X e is a design matrix and the inverse link function p is specified either as a logit link corresponding 
to p{x) = exp(x)/(l + exp(x)) or the probit link corresponding to p{x) = $(x) where $ denotes the fV(0,1) 
distribution function. For both link functions, the Fisher information matrix is available in closed form, and 
for the logit link the negative Hessian of log-likelihood function associated with (11) coincides with the Fisher 
information matrix, whereas this is not the case for the probit model. However, the negative Hessian is still 
positive definite in the relevant region for the probit model. 

I consider the collection of 5 data sets used by Girolami and Calderhead (2011) where n ranges between 
250 and 1000 and d ranges between 7 and 25. The sMMALA method using Gqc is used as a reference. For 
the logit model, this amounts to a replication of the simplified MMALA-part of the Monte Carlo experiment 
of Girolami and Calderhead (2011), and therefore admits calibration of their results against those presented 
here. In particular, I recoded the method of Girolami and Calderhead (2011) so that all codes use the same 
routines for evaluating likelihoods, gradients and Hessians to make the comparison as fair as possible. For 
AMH-MALA I employed 7 = 2.0 and Algorithm 1 with /3 = 20.0, p = 0.7. In this setting. Algorithm 1 works 
mainly as a safeguard against numerical problems occurring when some p [{XP)i] —)■ 0 or p [(A/3)i] —)■ 1. The 
results are presented in Table 2. Through out, I collect 5000 samples after 5000 iterations of burn in, and 
the timings are for producing the post burn in samples. All experiments are repeated 10 times and reported 
numbers are averages across these replica. In Girolami and Calderhead (2011), the simplified MMALA was 
found to be the best method for the logit model for 4 out 5 data sets when other contenders included the 
Riemann manifold HMC and full manifold MALA and the performance measure was the minimum (over P) 
ESS produced per unit of time. 

For the logit model, AMH-MALA may be interpreted as an adaptive step size version of Girolami and 
Calderhead (2011)’s simplified MMALA, I see that the line search mechanism leads to approximately a factor 
2 increase computing time, whereas the number of effective samples produced per iteration are approximately 
the same. Looking at the results reported in Table 3-7 in Girolami and Calderhead (2011) using simplified 
MMALA as a reference, I find that AMH-MALA performance roughly on par with Riemann manifold HMC 
for this model and these data sets. 

For the probit model, where —H{P) and Ggc do not coincide, I see slightly larger differences in terms of 
effective sample size in favor of AMH-MALA, whereas the relative consumption of CPU time is roughly the 
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same as in the logit case. To further investigate this find, I also implemented a method based on Ggc but 
otherwise identical to AMH-MALA and denote this method Adaptive sMMALA. It is seen that the improved 
ESSes appear to be a feature of the application of the Hessian rather than adaptive step size selection as 
the ESSes of sMMALA and Adaptive sMMALA are roughly the same. From this I may conclude that the 
negative Hessian may be a better alternative than the Fisher information-based metric tensor for models 
where the Hessian is positive definite in all relevant regions. 

4 Discussion 

This paper makes usage of the Gill et al. (1981) modified Cholesky factorizations for producing positive 
definite metric tensors from the Hessian matrix for simplified manifold MALA methods. A new adaptive 
step size procedure that resolves the shortcomings of metric tensors derived from the Hessian in regions 
where the log-target has near zero curvature in some direction is proposed. The adaptive step size selection 
also appears to alleviate the need for different tuning parameters in transient and stationary regimes. The 
combination of the two constitute a large step towards developing reliable manifold MCMC methods that 
can be implemented for models with unknown or intractable Fisher information, and even for targets that 
does not admit a factorization into prior and likelihood. Through examples of low to moderate dimension, it 
is shown that proposed methodology performs very well relative to alternative MCMC methods. 

To handle high-dimensional problems, it is likely that a HMC variant of the proposed methodology 
is needed. One avenue would be to make G(x) a smooth function via the usage of soft-max functions 
(Betancourt, 2013) in Algorithm 2 and implement full Riemann manifold HMC along the lines of Girolami 
and Calderhead (2011). This approach would enable the exploitation of sparsity of the Hessian not afforded 
by methods based on spectral decompositions (Betancourt, 2013), but would still require computing third 
derivatives of logfr. An interesting alternative that is currently being investigated involves embedding HMC 
into a population MCMC framework where each member of the population has the same target. In such 
a setup, one member of the population is expended in each MCMC iteration for calculating a position- 
specific mass matrix and time integration step size using the modified Cholesky factorization and adaptive 
step size selection procedure proposed here. These parameters are then applied in standard HMC updates 
of the remaining population members to mimic the effects of Riemann manifold HMC while retaining a 
computationally attractive separable Hamiltonian. 

Another avenue for further work is to extend the adaptive step size methodology via energy error ar¬ 
guments of Section 2.3 to other MCMC methods, via the observation that other, possibly non-symplectic, 
numerical integration schemes applied to Hamilton’s equations associated with (7) leads to different known 
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proposal distributions. In particular, a symplectic Euler type B integrator (Leimkuhler and Reich, 2004, 

page 26) lead to a N{x. + £^G(x)“^g(x),£^G(x)“^) proposal which nests (for £ = 1) the stochastic Newton 

MCMC method of Martin et al. (2012). A standard Euler integrator lead to a position specific scale random 

walk proposal 7V(x,£^G(x)“^). 
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Algorithm 2 A variant of the modified Cholesky decomposition of Gill et al. (1981). 


Input: d X d symmetric matrix A and a small scale factor u. 

step 0,1: L <— Id- 

step 0,2: Djj -e- Ajj, j = 1,.. .d. 

step 0,3: ly •<— max^ |Ajj |. 

step 0,4: ^ ^ max^^j \ Aij\. 

step 0,5: <p ^r- max(i/, _ 1,^). 

step 0,6: 5 ^r- umax(:^, 1). 


= 1 to 

n 





step 1: 

if(j 

> 

1) ^j,k Tj/j,, k 


- 1. 

step 2: 

if(j 

< 

d) Lj^i-dj •<— Aj^i-dj- 



step 3: 

if(l 

< 

j < d) Lj^i-dj ^ 

d,j — {Lj + l: 


step 4: 

if(j 

< 

d) 6j ^r- maxj<fe<d \Lk,j 

1 else 9j ^r- 

0. 

step 5: 


^ max(d, 



step 6: 

if(j 

< 



= j -t 1,..., d. 


end for 

Return L and D (so that LDL^ = A + J) or L = (so that LL^ = A + J). 


A A version of the Gill et al. (1981) modified Cholesky 

This section recaptures GMW’s square root-free modified Gholesky algorithm for finding lower triangular 
matrix L with L^ i = 1, i = 1,..., d and diagonal matrix D so that 

LDL^ = A + J. 

The conventional lower triangular Gholesky factor given in (3) obtains as L = GMW’s approach 

takes as vantage points: 

1. A uniform lower bound on the diagonal elements of D (or L), i.e. Dj j = > d, j = 1,..., d for a 

small positive constant <5. 

2. A uniform upper bound on the absolute off-diagonal elements oi L = LD^/'^ that ensures that A is 
positive definite and numerically stable, i.e. \Li jyjDj j \ < 4), i = 2,... ,d, j < i. 

By taking cfP = max(i^, ^/Vd^ — l,u) where v and ^ are the maximal absolute diagonal- and off-diagonal 
elements respectively. Gill et al. (1981) show that A is positive definite while at the same time a bound on 
the infinity norm of J is minimized. Throughout this paper, 6 is taken to be umax(:^, 1). The scale factor 
u is commonly taken to be close to machine accuracy for optimization applications, but in this work it is left 
as tuning parameter. 

GMW’s approach for implementing the above bounds is based on a square root free Gholesky factorization 
(Nocedal and Wright, 1999, p. 146) and is given in Algorithm 2. Gill et al. (1981) observed that immediately 
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after step 3 in the jth iteration of the algorithm, the jth column of the unfinished L contains Djj times the 
jth column of the final L. Therefore Dj j can be chosen (step 4 and 5) during the course of the algorithm 
to respect the two bounds given above. The difference in Djj between after and before step 4 in iteration 
j amounts to the jth diagonal element of J, and simply removing steps 4 and 5 corresponds to a standard 
standard square root free Cholesky algorithm. The modifications essentially adds 0{(P) comparisons to the 
computational complexity relative to the standard dense Cholesky algorithm it derives from, and is therefore 
asymptotically insignificant comparing to the 0{(P/6) floating point operations needed to carry out the latter. 


B Proof of Proposition 1 

Irreducibility of the update of x in steps 1-4 is ensured by the fact that the proposal distribution is Gaussian 
with finite and positive definite covariance matrix whenever 0 < £(x, w) < oo for all (x, w) in the support of 
7r(x, w). The update of w in step 5 is trivially irreducible since it amounts to iid sampling. Consequently 
the overall update in steps 1-5 is irreducible as any point on the support of the target is attainable in each 
iteration. 

Aperiodicity of the update of x is ensured by the fact that it involves a non-trivial accept-reject in step 
4 (Robert and Casella, 2004, section 7.3.2) and aperiodicity of the update of w is trivial since it it iid. 
Consequently, the overall update in steps 1-5 is thus aperiodic. 

To verify that 7r(x,w) is the invariant distribution of steps 1-5, first observe that for each given Wj, steps 
1-4 define a reversible MH step with 7r(x) as invariant distribution. The reversibility is consequence of the fact 
that £/,£{, are computed with the same w-argument, and therefore within steps 1-4 of each iteration £(x, w) 
is effectively a function of the first argument only. Denote by A(xi_|_i jx*, w^) the transition kernel associated 
with steps 1-4. Then the overall transition kernel of steps 1-5 may be written as i?(xi_|_i,Wt+i|xi,Wt) = 
A(xi_|_i|xt, Wi)7r(wt_|_i). That 7r(x,w) is the invariant distribution of B is then seen via 


J j 7r(x, w)i?(x', w'|x, w)(fxc?w = 7r(w') J 7r(w) J A(x'|x, w)7r(x)(ixdw, 

= 7r(w')7r(x') = 7r(x', w'). 


where the second equality follows from the reversibility of steps 1-4 for each w. 
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